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Tracking human physical activity using smartphones is an emerging trend in 
healthcare monitoring and healthy lifestyle management. Neural networks 
are broadly used to analyze the inertial data of activity recognition. Inspired 
by the autoencoder neural networks, we propose a layer-wise network, 
namely principal coefficient encoder model (PCEM). Unlike the vanilla 
neural networks which apply random weight initialization and back- 
propagation for parameter updating, an optimized weight initialization is 
implemented in PCEM via principal coefficient learning. This principal 
coefficient encoding allows rapid data learning with no back-propagation 
intervention and no gigantic hyperparameter tuning. In PCEM, the most 
principal coefficients of the training data are determined to be the network 
weights. Two hidden layers with principal coefficient encoding are stacked 
in PCEM for the sake of deep architecture design. The performance of 
PCEM is evaluated based on a subject-independent protocol where training 
and testing samples are from different users, with no overlapping subjects in 
between the training and testing sets. This subject-independent protocol can 
better assess the generalization of the model to new data. Experimental 
results exhibit that PCEM outperforms certain state-of-the-art machine 
learning and deep learning models, including convolutional neural network, 
and deep belief network. PCEM can achieve ~97% accuracy in subject- 
independent human activity analysis. 
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1. INTRODUCTION 


Human activity recognition (HAR) becomes more prominent with the increasing and advancement 
of the smart home concept, healthcare monitoring and healthy lifestyle management [1]. Using remote 
healthcare monitoring tools on smartphones is very common and prevalent in remote mobile health 
monitoring (RMHM) systems. The adoption of smartphones in HAR is made competent with its ability to 
capture motion data through its multiple inertial sensors as well as its significant attachment in our daily life 
[2]. With a simple and straightforward installation process, HAR app can be activated on the smartphone and 
running in the background to track our physical activity. 

Smartphone-based HAR is usually developed in four phases: data acquisition, data segmentation, 
feature extraction and classification. During data acquisition, various factors such as the position of the 
smartphone, and collection frequency are considered. In literature, different positions’ placements of the 
smartphone have been investigated, e.g.: in front pocket, back pocket, in hand, on waist, typing, and phoning 
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[1]-[4]. Multiple sampling frequencies are also explored and used in collecting data [4]-[6]. There is a 
certain degree of influence of the data segmentation size on the recognition performance. Thus, various 
segmentation sizes have been examined with different sliding window timings of 2.56 s, 1 s, 6.7 s, 7.5 s and 
10 s [1], [3], [7] during the data segmentation phase. 

Upon the completion of data segmentation phase, the salient features shall be selected from the 
samples. This phase can be known as feature extraction. Variant feature extraction techniques have been 
explored in the past, these include those from hand-crafted to the adoption of deep learning approaches 
[8]-[10]. Group-based context-aware classification method for human activity recognition on smartphones, 
namely group-based context-aware human activity recognition (GCHAR), was proposed by Cao et al. [9]. In 
GCHAR, a hierarchical group-based scheme is adopted to lessen the misclassification through context- 
awareness in lieu of intensive computation. Inter-group and intra-group hierarchical classification schemes 
are constructed to diminish the prediction load for each classification level. The context-awareness is 
engaged to detect the transitions among activities. Compared with RandomTree, Bagging, J48, and 
BayesNet, the supremacy of GCHAR is proven in terms of model training efficiency and classification 
performance. Ahmed et al. [11] proposed a hybrid approach, utilizing filter and wrapper practices, for feature 
selection in HAR. In this approach, a sequential floating forward search is implemented to obtain optimal 
features. These chosen features are further analyzed by using support vector machine (SVM) specifically in 
the task of data classification. Experimental results showed that this feature selection-based system was able 
to achieve 6% higher accuracy compared to those without feature selection. The only downside of these 
hand-crafted techniques is prior expert knowledge or rigorous empirical study is extensively needed for 
feature engineering. In recent years, numerous deep learning methods were proposed [6], [12]-[14]. Deep 
learning methods perform automated feature extraction and produce a more detailed abstraction for data 
representation when the network grows deeper. For instances, convolutional neural network (CNN) [1], [15], 
autoencoder [16], deep belief network (DBN) [17], recurrent neural network (RNN) [5], long short term 
memory (LSTM) [18], [19] based techniques have been rigorously explored to extract the deep features of 
motion inertial signal for HAR. The rightful selection of features is definitely useful in determining the 
recognition rate of the entire system later. 

The last step of recognizing HAR is through the usage of classifiers. The classification model is 
built based on the extracted feature set from the previous step. Various machine learning approaches, either 
single standalone classifier or fusion of multiple classifiers, have been widely studied by the field experts. 
Notable classifiers such as decision tree (DT), multilayer perceptron (MLP), SVM, random forest (RF), 
logistic regression, and extreme gradient boost (XGB) were examined in the smartphone-based HAR [2], [3], 
[20], [21]. The good performance of deep learning methods in pattern recognition is undeniable. However, 
the exceptional accomplishment of these approaches is contingent upon enormous training samples for model 
generalization and high-performance hardware requirements to support expensive computational loads [22]. 
Besides, some parameters need to be initialized or even tuned in the deep neural networks, such as random 
weights initialization and back-propagation weight tuning. Contrary to these classic deep learning networks, 
we propose a deep analytic model to analyze the human motion data with fast weight initialization without 
the usage of back-propagation. Inspired by the autoencoder architecture, this proposed system is a layer-wise 
network, namely principal coefficient encoding model (PCEM). The main contributions of this paper are: 

a. A subject-independent human activity recognition with a deep analytic model that can generalize well to 
new data without a massive training set. In PCEM, model training is performed based on data samples of 
a group of subjects, while the system efficacy is tested on the samples of another group of subjects. 

b. An optimized principal coefficient weight initialization in the neural layers allows quick data learning 
with no back-propagation intervention and no gigantic hyperparameter tuning. Contrary to the classic 
deep learning systems which require graphics processing unit (GPU) for computation, the model training 
of principal coefficient encoder model (PCEM) applies only a central processing unit (CPU) due to its 
light computation. 

c. An extensive performance analysis with two different classification modes. The effectiveness of PCEM is 
examined in two-class classification (i.e. distinguishing active and passive physical activities) and 
multiclass classification (i.e. distinguishing the types of each activity). 


2. THE PROPOSED METHOD 

In this work, smartphone embedded sensors are utilized in performing human activity recognition. 
The overall architecture of the proposed PCEM is depicted in Figure 1. From the figure, we can notice that 
the proposed architecture comprises four stages: data acquisition, data preprocessing, feature extraction via 
principal coefficient encoding and data classification. 
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Figure 1. The architecture of the proposed PCEM 


2.1. Data preprocessing 

The acceleration and angular velocity signals are preprocessed via a low-pass filter to reduce noise. 
A sliding window of 2.56 s and fifty percent overlap is implemented to segment the waveform signals. The 
segmented data is further processed to generate meaningful feature variables as illustrated in Figure 2. 


Average value Signal entropy 

Standard deviation Auto-regression coefficients 
Median absolute value Correlation coefficient 
Largest value in array Largest frequency component 
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Signal magnitude area Frequency signal skewness 


Average sum of Frequency signal kurtosis 
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Figure 2. Process data into feature variables 


2.2. Feature extraction: principal coefficient encoding 

In autoencoder, data input x is multiplied with a weight matrix W with a bias vector b. The weights 
and biases are initialized randomly and then updated iteratively through backpropagation. Then, a nonlinear 
activation function f is applied to obtain the encoder’s output, called code y. 


y = f(W.x+b) a) 


In the principal coefficient encoder, normalized data is firstly computed by subtracting the mean vector m 
from each of the data dimensions from the input data x. Next, the resultant data is multiplied by an 
orthogonal matrix (i.e. weight matrix) V to obtain the result pc- 
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Voc = V(x —m) =V.x—V.m (2) 


To produce a nonlinear model to better cater the complex real-world data, a nonlinear input-output mapping 
function f is applied to Jpc to obtain code ypc. 


Ypc = fV.x —V.m) (3) 
Let Xp = 0, we will have 
f(V.0-—V.m) = f(W.0 +b) (4) 


Let the activation function f is a strictly monotonic function, b = —V.m is obtained. The result is injected in 
(3) and gets an equivalent expression in (1). 


fV.x+b) = f(W.x+b) (5) 


Hence, we achieve W = V. In this study, we apply parametric rectified linear unit as the nonlinear activation 
function, f(z) = max(« z,z) where « = [0,1]. In this work, two hidden layers with principal coefficient 
encoding are developed for multi-layer feature extraction to learn data representation with multiple levels of 
abstraction, see Figure 1. The intermediate activated coefficients are combined with x and further analyzed in 
the second hidden layer to encode deeper features. 


2.3. Classification 

Owing to the flexibility characteristic of SVM, varieties of classification problems can be resolved 
with minimal tuning. On top of that, the automatic complexity control in SVM is able to solve overfitting 
concerns. The real-world data which is randomly distributed in a nonlinear way can be tackled via an 
adequate kernel trick in SVM to bridge linearity to nonlinearity. In this work, a nonlinear SVM is employed 
to generate a decision boundary for classifying the extracted code. The idea of the nonlinear classifier is 
fairly analogous to the linear SVM, but a kernel function is applied to represent the similarity of vectors 
(i.e. codes) in a kernelized feature space over polynomials of the original variables. This allows the learning 
of nonlinear modelling. The kernel implementation allocates the data in a higher-dimensional space so that a 
decision hyperplane can be structured in this new kernel feature space, as illustrated in Figure 3. 

The mapping from an original input space into a kernelized feature space is formulated: 


yr oly) (6) 


In other words, the function g is computed for the mapping: 


gy) =w.P(y) +d (7) 


Computing ® for each sample is rather inefficient. The kernelized feature space is in a very high 
dimensionality or even with infinite dimensions, resulting in the hardness representation of w in memory. 
Hence, kernel function K, K(i,j) = (i). (j), is implemented to avoid explicit computation of each ®, 
refer [23]. 


hyperplane 


A 
Original input space Kernelized feature space 


Figure 3. Nonlinear SVM mapping 
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3. RESULTS AND DISCUSSION 

To evaluate the classification performance, as well as the generalization ability of the proposed 
PCEM model, we test it on the human static-postures and dynamic-activities (HSD) database. In this 
database, there are three static postures (sitting, standing and laying) and three dynamic activities (walking, 
walking upstairs and walking downstairs). During the data collection process, thirty volunteers (19-48 years 
old) were carrying a Samsung Galaxy S II smartphone (bundled with accelerometer and gyroscope sensors) 
on the waist. The total accelerometer, estimated body accelerometer and gyroscope data were captured at a 
50 Hz rate. The process of data collection was disclosed in detail in [2]. In this work, four performance 
evaluation metrics are employed, which are classification accuracy, precision, recall and Fl-score. The 
experiments are conducted using CPU with Intel (R) Core i7-7700K 4.2 GHz and RAM 48 GB on Matlab 
R2018a platform. 


3.1. Parameter performance analysis 

Parametric rectified linear unit function is introduced in the principal coefficient encoder for a 
nonlinear input-output mapping to better model the complex real-world data. In this experiment, the 
influence of the activation function parameter « is examined. Figure 4 illustrates the performance measures 
of different x values. It is noticed that the system performance is quite stable across different « values, 
except x=0.1 with about 1% lower accuracy. We adopt x=0.5 for the subsequent experiments. 
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Figure 4. Performance of different « 


Next, the performance analysis with different dimension reductions in the hidden layers of PCEM is 
examined, as illustrated in Figure 5. From the empirical results, we observe that the system performance is 
improving when the reduced or preserved dimension is getting higher. There is an impressive performance 
improvement from the dimension of 20% (i.e. only 20% dimensional features are preserved) to 50%, from 
94.4% to 96.8% accuracy score. However, when more and more dimensions are preserved 
(i.e. higher-dimensional code), PCEM exhibits a slight performance degradation. As observed in Figure 5, 
when excessive dimensional features are preserved (>60%), there is a slight accuracy degradation. This could 
be attributed to the presence of the uninformative components (i.e. noise) in the code, rising the 
misclassification. Real-world data is embedded with correlated features. The redundant information is treated 
as noise which could pessimistically affect the classification learning model. A reasonable dimensionality 
reduction in PCEM (50%-dimension reduction in this case) helps eliminate the irrelevant and redundant data, 
producing effective code which is useful for data classification. Hence, 50%-dimension reduction is 
employed. 


3.2. System performance analysis 
We examine the efficiency of PCEM in two modes: 1) two-class classification: human activities are 


classified based on intensity level-active or passive activity and ii) multiclass classification: human activities 
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are classified into one of C possible classes. Walking, climbing stairs and jogging exhibit periodic behavior 
patterns generated by the repetitive movements of the activities. We categorize those activities with periodic 
movement patterns as an active class in the two-class classification mode; whist, activities of sitting, laying 
and standing are grouped as a passive class. As aforementioned, the efficiency of PCEM is assessed as a 
subject-independent solution. In other words, PCEM is trained using training samples from a group of users. 
Then, the model is applied to new users without the necessity of collecting additional samples of these new 
users to retrain the model. In this experiment, HSD dataset is partitioned into two sets: 70% of the volunteers 
are selected to generate the training samples and the remaining 30% of the volunteers’ data is used for 
testing. PCEM needs 10.0275 s to train the model with 7,352 samples and 1.0724 s to test 2,947 samples. 
Table 1 summarizes the performances of PCEM in two-class and multiclass classification respectively. From 
the results, we notice that the proposed model excels in classifying active and passive activities where it can 
achieve 100% accuracy. PCEM also achieves an excellent score in F-score, precision and recall. This 
indicates that the model generates zero false positives and false negatives in distinguishing active and passive 
activities. Figure 6 illustrates the confusion matrices of the two modes. There is no misclassification between 
the active and passive activities as shown in Figure 6(a) and low misclassification among activity classes as 
shown in Figure 6(b). 
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Figure 5. Performance of different dimensions in the hidden layers 


Table 1. Performance of PCEM in two-class and multiclass classifications 
Precision Recall F-score Overall Accuracy (%) 


Two-class classification Passive 1.000 1.000 1.000 100 
Active 1.000 1.000 1.000 
Average 1.000 1.000 1.000 
Multiclass classification Stand 0.904 0.976 0.939 96.8103 
Sit 0.971 0.884 0.925 
Lay 1.000 1.000 1.000 
Walk 0.965 0.998 0.981 


Downstair 1.000 0.974 0.987 
Upstair 0.983 0.975 0.979 
Average 0.969 0.968 0.968 


The performance of PCEM is slightly dropped in multiclass classification. The model obtains 
approximately 97% in classifying different types of activities. Though it is not as perfect as the two-class 
model, it is still an encouraging observation. From the empirical results, we notice that PCEM records lower 
F-score in stand and sit classes, indicating more false positives and negatives in distinguishing stand and sit 
classes as illustrated in the confusion matrix. The resemble almost-constant patterns of these stationary 
activities may be the reason for the misclassification. But, PCEM is still able to extract the underlying 
distinct patterns of the data with minor false positives and negatives. 
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Figure 6. Confusion matrix of PCEM in (a) two-class and (b) multiclass activity classifications 


3.3. Comparison and discussion 

In this experiment, we compare the performance of other models with the proposed PCEM. For a 
fair comparison, the same database is used, and the subject-independent protocol is implemented. Table 2 
shows the classification accuracy of various approaches in HAR on HSD database. 

From the empirical results, we can notice that: i) The proposed PCEM demonstrates a superior 
performance than other approaches. This is because the optimized principal coefficient learning and 
discriminant classifier in PCEM competently analyze the motion features from lower to deeper level via its 
stacking layer-wise architecture; ii) The performances of CNN and deep network (DBN) are slightly poorer 
than that of PCEM, achieving ~95% accuracy. This could be due to the insufficient training sample to 
generalize and optimize the model learning. Besides, CNN and DBN suffer from training efficiency [14], 
which grow in computational complexity with the number of layers. On the other hand, PCEM is a fast- 
analytic solution owing to no back-propagation and no puzzling parameter tuning; iii) In this HAR 
application, the stacked autoencoder is not able to perform well. This is because the number of samples is 
insufficient, affecting its learning ability. With limited training samples, it is difficult for the model to obtain 
full features learned through encoding learning; and iv) PCEM is shown to be an auspicious solution in HAR 
with minimal training efforts (no iterative learning and back-propagation). This analytic solution is trainable 
without GPU, but only CPU. 


Table 2. System comparison of the existing approaches in HAR 


Approach Accuracy (%) 
CNN* [24] 95.75 
ANN* (reported in [24]) 91.08 
GCHAR* [9] 94.16 
Deep Belief Network* (reported in [17]) 95.80 
Hierarchical Continuous Hidden Markov Model* [25] 93.18 
Stacked autoencoder* [16] 89.64 
PCEM 96.81 


* Results are extracted from the respective papers 


4. CONCLUSION 

Inspired by the autoencoder neural network, a layer-wise network, called as PCEM, is presented. 
The core difference between the autoencoder and PCEM is the former initializes the weights with random 
values and applies back-propagation for parameter/weight update; but the later implements a principal 
coefficient weight initialization that allows rapid data learning with no back-propagation intervention. Unlike 
other classic deep neural networks (e.g. CNN), there is no gigantic hyperparameter tuning in PCEM. In this 
work, a subject-independent testing protocol is implemented to evaluate the performance of PCEM. This 
subject-independent protocol can better assess the generalization of the model in recognizing new data. 
Empirical results demonstrate the superiority of PCEM to the other machine learning and deep learning 
models, with recognition accuracy ~97% in the activity recognition. Furthermore, the model training time for 
PCEM is lesser than one minute, whereas most classic deep networks require hours to train the model due to 
the complex computation and enormous parameter tuning. 
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